1 Introduction

[ThG: I have edited/commented sections 1-4. Overall I think you should (1) provide a simple illustration in the introduction that summarizes all 3 methods (annot, manual, co-clustring). Figure 5 can then be removed since it is hard to follow and contains too much information relating to the supplement which we wanted to remove from this vignette. (2) Further please reduce the number of code chunks to make things easier for the user. For clarity you could use meta functions and summarize in the text what they are doing. Often you include the individual steps but then there is no explanation given what they are doning. (3) The optimization part needs to be entirely removed from the main part of this vignette as there is no context given what its relevance is. (4) Please also simplify the Supplement section 5 or remove it from the vignette and perhaps include it in a script file that is then just referenced for expert users in the text. Once this part 5 has been improved I will fine edit it. (5) In addition, please address my comments given in red font throughout the text.]

1.1 Overview

The primary utility of the spatialHeatmap package is the generation of spatial heatmaps (SHM) for visualizing cell-, tissue- and organ-specific abundance patterns of biological molecules in anatomical images (J. Zhang et al. 2022). This is useful for identifying genes with spatially enriched (SE) expression patterns as well as clusters and/or network modules composed of genes sharing similar expression patterns. These functionalities are introduced in the main vignette of the spatialHeatmap package. The following describes extended functionalities for integrating bulk tissue with single cell data by co-visualizing them in composite plots that combine spatial heatmaps with embedding plots of high-dimensional data. The resulting spatial context information is important for gaining insights into the tissue-level organiztion of single cell data.

The required quantitative assay data, such as gene expression values, can be provided in a variety of widely used tabular data structures (e.g. data.frame, SummarizedExperiment or SingleCellExperiment). The corresponding anatomic images need to be supplied as annotated SVG (aSVG) images. The creation of aSVGs is described in the main vignette of this package. For the embedding plots of single cell data, several dimensionality reduction algorithms (e.g. PCA, UMAP or tSNE) are supported. To associate single cells with their source tissues, the user can choose among three major methods including annotation-based, manual and automated methods (Figure 1). Similar to other functionalities in spatialHeatmap, these functionalities are available within R as well as the corresponding Shiny app (Chang et al. 2021, Chang and Borges Ribeiro (2018)).

1.2 Methods for Associating Cells and Bulk Tissues

The main idea behind co-visualizing bulk tissues and single cells (Figure 1) is first assigning individual cells with group labels, then aggregating cells within each group by taking mean or median of assay profiles, and finally associating aggregated cell groups to aSVG spatial features (bulk tissues).

Figure 1.1 is the standard single cell data container SingleCellExperiment (Amezquita et al. 2020). The assays, colData, rowData, and reducedDims slots contain experssion profiles, cell metadata, molecule metadata, and reduced dimensionalities produced by algorithms of PCA, TSNE, or UMAP (A. T. L. Lun, McCarthy, and Marioni 2016, McCarthy et al. (2017)) respectively. In co-visualization, the cell group labels are stored in the colData slot as shown in Figure 1.1. Three methods are available for obtaining cell labels. The annotation-based (Figure 1.2a) and manual methods (Figure 1.2b) are similar, since the main difference is how the cell labels are generated. In the annotation-based method, existing cell annotation labels are usually available and can be stored in a suitable format such as in a SingleCellExperiment object. For example, prepopulated instances of these containers are provided by the scRNAseq package (Risso and Cole 2022). The manual method allows users to create the cell to tissue associations one-by-one or importing them from a tabular file. In contrast to this, the automatic method aims to assign single cells to the corresponding source tissue computationally by a co-clustering algorithm (Figure 1.2c).

Cells with the same group label are aggregated by taking mean or median of assay profiles (Figure 1.3). Cells before aggregation are colored by their within-group aggregated assay profiles in an embedding plot created on reduced dimensionalities (A. T. L. Lun, McCarthy, and Marioni 2016, McCarthy et al. (2017)). These cell group colors are used to fill aSVG spatial features (bulk tissues) that are matched according to cell labels. The ouput is a co-visualization plot as shown in Figure 1.4.

The most common inputs are raw count data (e.g. RNA-seq) of chosen bulk tissues and single cells of the same organ (Figure 1.1), and the single cells come from the whole organ or at least cover the chosen bulk tissues. Bulk and single cell data are filtered to remove the most noisy data.1 No need to describe the processing and challenges of single cell data here. The main difference between bulk and single cells is the sparsity in the latter. To reduce such difference, the bulk and single-cell data are combined and normalized together, and subsequently separated 1.2).2 I don't understand the combined and separated part? Next, dimensionality reduction is performed on single-cell data using PCA or UMAP method, where one resultant dimensionality is equivalent to one molecule (e.g. gene). The top dimensionalities are clustered (Figure 1.4) to generate single cell clusters. Each cell cluster is refined by removing cells having low similarities with other cells in the same cluster (Figure 1.5). As a result, each cell cluster is more homogeneous (Figure 1.6). Next, bulk and single cell data are combined and co-clustered (Figure 1.7) to produce co-clusters.3 The text implies that the clustering is performed on the embedding data of the single cell data. However, what data is used for the bulk is unclear.

There are three types of co-clusters: 1) Multiple bulk tissues are co-clustered with many cells.4 I don't understand the part with the three types of clusters. To me it seems you are describing something that is too trivial to be mentioned, the name choice clusters is confusing or I am missing your point in general. Spearman's correlation coefficient (similarity) is calculated for any bulk-cell pair in the same co-cluster. If a bulk has the largest similarity with a specific cell than any other bulk, this bulk is assigned to this cell as source bulk. For example, in Figure 1.8a bulk A is assigned to cell a1 because a1 has higher similarity with A than B; 2). Only one bulk tissue is co-clustered with many cells. This bulk is assigned to all the cells in the same co-cluster (1.8b); and 3) No bulk is present in a cluster. All these cells are not considered for downstream co-visualization (Figure 1.8c), which are potential novel cell populations. After the automatic bulk assignments to cells, cells with the same bulk label are aggregated in the same way as annotation-based or manual method and subsequently the aggregated assay profiles are mapped to aSVG features.

[ThG: the above text, the illustration and the image legend need to be updated. I tried to follow but there are too many things I don't understand. Please substantiallyi simplify both the figure and the text.]

Overview of co-visualization. (1) Single cell data are stored in a `SingleCellExperiment` object, where assay data, cell metadata, molecule metadata, and reduce dimensionalities (PCA, TSNE, UMAP) are stored in the `assay`, `colData`, `rowData`, and `reducedDims` slots respectively. The cell labels are stored in the `colData` slot. (2) Three methods are provided for obtaining cell labels: annotation-based, manual, and automatic. (3) Cells are aggregated by taking mean or median of assay profiles under the same label. (4) Aggregated assay profiles are used to color the corresponding cells with the same label in an embedding plot and the aSVG spatial features that are matched by cell labels. The output is a co-visualization plot. ^[ThG: a description of the itemized steps 1-8 is missing. Each or at least ranges of steps need to be mentioned in the legend. In addtion, the illustration includes itemization with letters only for the right part. This needs to be done consitently. Moreover, the Annotation and Manual parts are essentially identical and their data structure is the same as under step 8 of the Automated method. This means you can much better summarize things by including this table once. This would be much more economic and less confusing since you would avoid a lot of duplication you have in the illustration. Moreover, the preprocessing part can be simplified, no one will understand the 'separating' part (see comment above).]

Figure 1: Overview of co-visualization
(1) Single cell data are stored in a SingleCellExperiment object, where assay data, cell metadata, molecule metadata, and reduce dimensionalities (PCA, TSNE, UMAP) are stored in the assay, colData, rowData, and reducedDims slots respectively. The cell labels are stored in the colData slot. (2) Three methods are provided for obtaining cell labels: annotation-based, manual, and automatic. (3) Cells are aggregated by taking mean or median of assay profiles under the same label. (4) Aggregated assay profiles are used to color the corresponding cells with the same label in an embedding plot and the aSVG spatial features that are matched by cell labels. The output is a co-visualization plot.5 ThG: a description of the itemized steps 1-8 is missing. Each or at least ranges of steps need to be mentioned in the legend. In addtion, the illustration includes itemization with letters only for the right part. This needs to be done consitently. Moreover, the Annotation and Manual parts are essentially identical and their data structure is the same as under step 8 of the Automated method. This means you can much better summarize things by including this table once. This would be much more economic and less confusing since you would avoid a lot of duplication you have in the illustration. Moreover, the preprocessing part can be simplified, no one will understand the 'separating' part (see comment above).

2 Getting Started

2.1 Installation

The spatialHeatmap package can be installed with the BiocManager::install command.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("spatialHeatmap")

2.2 Packages and Documentation

Next, the packages required for running the sample code in this vignette need to be loaded.

library(spatialHeatmap); library(SummarizedExperiment); library(scran); library(scater); library(igraph); library(SingleCellExperiment); library(BiocParallel)

The following lists the vignette(s) of this package in an HTML browser. Clicking the name of the corresponding vignette will open it.

browseVignettes('spatialHeatmap')

To reduce runtime, intermediate results can be cached under ~/.cache/shm.

cache.pa <- '~/.cache/shm' # Set path of the cache directory

2.3 Quick Start

To obtain reproducible results, a fixed seed is set for generating random numbers.

set.seed(10)

A single cell test data set is loaded using mockSCE from the scuttle package (McCarthy et al. 2017). The data are normalized and log2-transformed (logNormCounts), and subsequently processed by dimensionality reduction methods including PCA, TSNE, and UMAP (reduce_dim).

sce.test <- mockSCE() 
sce.norm.test <- logNormCounts(sce.test)
sce.dimred.test <- reduce_dim(sce.norm.test) 
## "prop" is set 1 in "getTopHVGs" due to too less genes.

The downstream co-visualization is demonstrated on the cell cycle labels (Cell_Cycle) including G1, S, G0 and G2M.

colData(sce.dimred.test)[1:2, ]
## DataFrame with 2 rows and 4 columns
##          Mutation_Status  Cell_Cycle   Treatment sizeFactor
##              <character> <character> <character>  <numeric>
## Cell_001        positive          G1      treat2   1.029846
## Cell_002        positive          G1      treat2   0.952431
unique(colData(sce.dimred.test)$Cell_Cycle)
## [1] "G1"  "S"   "G0"  "G2M"

To visualize single cell data in spatial heatmaps, the single cell expression values are aggregated by their source tissues using common summary statistics such as mean or median.

sce.aggr.test <- aggr_rep(sce.dimred.test, assay.na='logcounts', sam.factor='Cell_Cycle', con.factor=NULL, aggr='mean')
colData(sce.aggr.test)
## DataFrame with 4 rows and 4 columns
##     Mutation_Status  Cell_Cycle   Treatment sizeFactor
##         <character> <character> <character>  <numeric>
## G1         positive          G1      treat2   1.029846
## S          positive           S      treat1   0.963111
## G0         positive          G0      treat2   1.000982
## G2M        negative         G2M      treat1   0.992205

The aggregated values are then used to color the tissues represented in the spatial heatmap. An example aSVG file of mouse brain is included in spatialHeatmap, and spatial features are extracted.

svg.mus.brain <- system.file("extdata/shinyApp/example", "mus_musculus.brain.svg", package="spatialHeatmap")
feature.df <- return_feature(svg.path=svg.mus.brain)
tail(feature.df$feature) # Partial features are shown.
## [1] "brainstem"                    "midbrain"                    
## [3] "dorsal.plus.ventral.thalamus" "hypothalamus"                
## [5] "nose"                         "corpora.quadrigemina"

One collapsed cell cycle is equivalent to one aSVG feature.6 I don't understand this example. If cell cycles are shown whey would they be mapped to different tissues. Every biologist will be confused by this. Or perhaps I am misunderstanding. Before co-visualization, their matching relationship needs to be defined. In annotation-based and manual methods, the matching relationship is defined by users in a named list, while in the automatic method it is automatically defined, which is the essential difference across the three methods. Note, one collapsed cell label can be matched to multiple aSVG features but not vice versa. The following is an example named list.

lis.match.test <- list(G1=c('hypothalamus'), G0=c('brainstem', 'medulla.oblongata'))

The co-visualization is created on gene Gene_0010. The embedding plot includes all single cells before aggregation. The G1 and G0 cells are colored by their respecitive aggregated expression profiles (data=sce.aggr.test) and indicated in the legend at the bottom. In the nearby spatial heatmap plot, aSVG features are filled by the same color as the matching cells defined in the named list (lis.match.test). The cell.group argument indicates which column is considered as cell labels in the colData slot of sce.aggr.test.

shm.lis.test <- spatial_hm(svg.path=svg.mus.brain, data=sce.aggr.test, ID=c('Gene_0010'), height=0.7, legend.r=1.5, legend.key.size=0.02, legend.text.size=12, legend.nrow=3, sce.dimred=sce.dimred.test, dimred='PCA', cell.group='Cell_Cycle', assay.na='logcounts', tar.cell=c('matched'), lis.rematch=lis.match.test, bar.width=0.11, dim.lgd.nrow=1)
Example co-visualization plot. The co-visualization is created on gene `Gene_0010`. Single cells in the embedding plot and their matching aSVG features in the spatial heatmap are filled by the same color according to aggregated assay profiles within each cell label.

Figure 2: Example co-visualization plot
The co-visualization is created on gene Gene_0010. Single cells in the embedding plot and their matching aSVG features in the spatial heatmap are filled by the same color according to aggregated assay profiles within each cell label.

2.4 Annotation-based

The annotation method mainly takes advantage of annotation labels in existing SCE objects created by the community, such as the SCE objects from the R package scRNAseq (Risso and Cole 2022).

[ThG: the Shiny part makes no sense in this annotation section since mouse actions would fall under manual mode. There must be an upload option in the app.]

The following example uses single cell data from oligodendrocytes of mouse brain (Marques et al. 2016), which is downloaded in scRNAseq (Risso and Cole 2022) with minor changes. Prior to co-visualizing, the single cell data are pre-processed with standard QC and normalization methods. The details of these steps are described in the OSCA tutorial.

set.seed(10)

The single cell sample data set can be imported as follows.

sce.pa <- system.file("extdata/shinyApp/example", "sce_manual_mouse.rds", package="spatialHeatmap")
sce <- readRDS(sce.pa)

The quatity control (QC), normalization, and dimensionality reduction steps can each be performed with a single command (process_cell_meta).

[ThG: you need to at least briefly state what algorithms/method are used in each step.]

[Short version]

In QC, common per-cell metrics are cacluated such as library size, mitochodrial gene percentage, etc. Then problematic cells are filtered out according to these metrics. Refer to perCellQCMetrics and perCellQCFilters in the scuttle package for more details (McCarthy et al. 2017). In normalization, per-cell size factors are computed using a scaling normalization method followed by a deconvolution strategy. Single cells are finally normalized by these per-cell size factors. See more details in quickCluster, computeSumFactors from the scran package (A. T. L. Lun, McCarthy, and Marioni 2016), and logNormCounts from the scuttle package (McCarthy et al. 2017).

[Long version]

In the QC, frequently used per-cell metrics are calculated for identifying problematic cells, such as library size, number of detected features above a threshold, mitochodrial gene percentage, etc. Then these metrics are used to determine outlier cells based on median-absolute-deviation (MAD). Refer to perCellQCMetrics and perCellQCFilters in the scuttle package for more details (McCarthy et al. 2017). In the normalization, a quick-clustering method is applied to divide cells into clusters. Then a scaling normalization method is performed to obtain per-cluster size factors. Next, the size factor in each cluster is decomposed into per-cell size factors by a deconvolution strategy. Finally, all cells are normalized by per-cell size factors. See more details in quickCluster, computeSumFactors from the scran package (A. T. L. Lun, McCarthy, and Marioni 2016), and logNormCounts from the scuttle package (McCarthy et al. 2017).

In dimensionality reduction, the high-dimensional gene expression data are embedded into a 2-3 dimensional space using PCA, tSNE and UMAP. All three embedding result sets are stored in the SingleCellExperiment object. Details are seen in denoisePCA from scran (A. T. L. Lun, McCarthy, and Marioni 2016), and runUMAP, runTSNE from scater (McCarthy et al. 2017). Subsequently, the UMAP result is visualized as an example in form of a scatter plot where the dots are colored by the corresponding cell labels.

sce.dimred <- process_cell_meta(sce, qc.metric=list(subsets=list(Mt=rowData(sce)$featureType=='mito'), threshold=1))

plotUMAP(sce.dimred, colour_by="label")
Embedding plot single cells. The cells are colored by labels in the `label` column in `colData`.

Figure 3: Embedding plot single cells
The cells are colored by labels in the label column in colData.

The following command returns a slice of cell metadata in the colData slot of SCE. In principle, every column can be utilized as annotation labels for the purpose of co-visualization. In this example, the label and expVar columns are taken as annotation labels and experiment treatment variables respectively.

colData(sce.dimred)[25:27, ]
## DataFrame with 3 rows and 7 columns
##                              label         age inferred.cell.type         Sex
##                        <character> <character>        <character> <character>
## C1.1772125.078.A03 corpus.callosum         p21              MFOL1           M
## C1.1772096.168.C12    hypothalamus         p28              MFOL1           M
## C1.1772117.102.H11     dorsal.horn         p21              MFOL1           ?
##                         strain         expVar sizeFactor
##                    <character>    <character>  <numeric>
## C1.1772125.078.A03  PDGFRa-Cre        control   1.077659
## C1.1772096.168.C12      C57BL6 6h.post.stress   0.758306
## C1.1772117.102.H11      C57BL6        control   1.070574

[ThG: First, the previous is an incomplete sentence that should only be used in a title. Second, this section is confusing. What do you mean by 'aggregate'? Isn't this done already by having the annotations available? Aren't you using the existing annotations to compute a summary statistics for each predefined group of cells belonging to a tissue specified by the annotations? Subsequently, the summary values are used to define the colors in the spatial heatmap, correct? If so try to outline this more clearly. The same applies ot the help file of the aggr_rep function. The text in this help file is hard to follow.

Each annotation label in the label column of colData defines a cell group or cluster (sam.factor='label'). The expression profiles of each gene are aggregated by mean (aggr='mean') within each cell group. If experiment variables are provided, here con.factor='expVar', the aggregation will be performed on the new variable as a combination of label and expVar. The aggregation is carried out on normalized data indicated by assay.na='logcounts'.

sce.aggr <- aggr_rep(sce.dimred, assay.na='logcounts', sam.factor='label', con.factor='expVar', aggr='mean')

A slice of aggregated data is shown. hypothalamus and control is an annotation label and experiment treatment respectively. The abundance mean of gene Tcea1 under the combined variable hypothalamus__control is 1.440910.

as.matrix(logcounts(sce.aggr)[1:2, 1:2])
##         hypothalamus__control SN.VTA__control
## Tcea1                1.440910         1.62474
## Atp6v1h              2.275958         1.95121

Next, the spatial features are extracted with the return_feature function from an aSVG image of mouse brain. This sample aSVG image file is provided by the package.

svg.mus.brain <- system.file("extdata/shinyApp/example", "mus_musculus.brain.svg", package="spatialHeatmap")
feature.df <- return_feature(svg.path=svg.mus.brain)
tail(feature.df$feature)
## [1] "brainstem"                    "midbrain"                    
## [3] "dorsal.plus.ventral.thalamus" "hypothalamus"                
## [5] "nose"                         "corpora.quadrigemina"

[ThG: Why is do you select here only one feature/tissue instead of all annotated features. Also the need/utility of this list is unclear.]

In practice, the identifiers of aSVG features and annotation label counterparts are not exactly the same, since they might come from different sources, such as corpus.striatum in aSVG and corpus.callosum in label. To coordinate the differences and give users flexibilities, a named list is required to match cell labels and aSVG features, where cell labels are in the name slots and aSVG features are list elements. The example list is created on three tissues that have same or similar countterparts between aSVG features and cell labels. Note, one cell label could be matched to multiple aSVG features, not vice versa.

lis.match <- list(hypothalamus=c('hypothalamus'), cortex.S1=c('cerebral.cortex'), corpus.callosum=c('corpus.striatum'))

[ThG: there needs to be a detailed explanation or legend explaining what is shown on the plot. Again coloring only one tissue creates the impression that tissues need to be depicted one by one instead of a single plot. I would change this example to highlight most tissues. Most genes are not so selectively expressed to show up in only one tissue, so choosing here an example where this is the case shouldn't be hard.

Co-visualization is created on aggregated expression profiles (sce.aggr). Here gene Cops5 is chosen as an example. The data includes two experiment variables control and 6h.post.stress, thus covisualization plots are created under each condition. The legend under embedding plots shows the cell labels present in the matching list (lis.match). The cell population of the same label under the same condition are colored by their aggregated expression profiles, and the bulk tissue counterpart is colored by the same color in the anatomical image. In the downloaded single cell data, only hypothalamus is present under 6h.post.stress, so only hypothalamus cells and bulk tissues are colored under 6h.post.stress. The tar.cell argument specifies which cell populations to show. It can be one or multiple cell labels, and matched indicates all cell labels in the matching list.

shm.lis <- spatial_hm(svg.path=svg.mus.brain, data=sce.aggr, ID=c('Cops5'), height=0.7, legend.r=1.5, legend.key.size=0.02, legend.text.size=12, legend.nrow=2, sce.dimred=sce.dimred, dimred='PCA', cell.group='label', assay.na='logcounts', tar.cell=c('matched'), lis.rematch=lis.match, bar.width=0.09, dim.lgd.nrow=2)
Co-visualizing single cells and bulk tissues by annotation cell labels. The expression profiles of gene `Cops5` are used. Cell population of the same annotation label is colored by their aggregated expression profiles under the same condition, and the bulk tissue counterpart is colored by the same color.

Figure 4: Co-visualizing single cells and bulk tissues by annotation cell labels
The expression profiles of gene Cops5 are used. Cell population of the same annotation label is colored by their aggregated expression profiles under the same condition, and the bulk tissue counterpart is colored by the same color.

2.5 Manual

In the manual method, cell group or cluster labels are manually created according to criteria or methods chosen by users such as clustering methods, which is very flexible than the annotation method. The resultant manual clusters need to be stored in a tabular file, then added to the SingleCellExperiment by the manual_group function. The same single cell data and aSVG file as in the annotation method are used to demonstrate the manual method. The steps of quality control, normalization, and dimensionality reduction are the same with the annotation method, while a subsequent step of adding manual clusters to the SingleCellExperiment is required.

[ThG: I don't understand why the manual method uses clustering??? If clustering needs to be used here then it is unclear what clustering algorithim is applied.]

An example manual group file is included in spatialHeatmap, where group labels are created by the clustering function cluster_cell. Refer to the help file for how to use this function. At least two columns are required in this file. The cell column contains single cell identifiers present in the rownames of colData slot in the SingleCellExperiment, while the cluster column contains the manual cell group labels.

manual.clus.mus.sc.pa <- system.file("extdata/shinyApp/example", "manual_cluster_mouse_brain.txt", package="spatialHeatmap")
manual.clus.mus.sc <- read.table(manual.clus.mus.sc.pa, header=TRUE, sep='\t')
manual.clus.mus.sc[1:3, ]
##                 cell cluster
## 1 C1.1772078.029.F11   clus6
## 2 C1.1772089.202.E04   clus6
## 3 C1.1772099.091.D10   clus3

Manual group labels are included in a column in the colData slot by the function manual_group, which does not interfere with existing columns. The cell and cell.group arguments indicate two columns corresonding to cells and cell group labels respectively in manual.clus.mus.sc.

sce.clus <- manual_group(sce=sce.dimred, df.group=manual.clus.mus.sc, cell='cell', cell.group='cluster')
colData(sce.clus)[1:3, ]
## DataFrame with 3 rows and 8 columns
##                        cluster        label         age inferred.cell.type
##                    <character>  <character> <character>        <character>
## C1.1772078.029.F11       clus6 hypothalamus         p22              NFOL1
## C1.1772089.202.E04       clus6       SN.VTA         p22              NFOL2
## C1.1772099.091.D10       clus3  dorsal.horn         p20              NFOL2
##                                       Sex      strain      expVar sizeFactor
##                               <character> <character> <character>  <numeric>
## C1.1772078.029.F11                      F      C57BL6     control    1.22941
## C1.1772089.202.E04 pooled male and female         CD1     control    1.16755
## C1.1772099.091.D10                      ?      C57BL6     control    1.08944

Embedding plot of single cells colored by manual groups in the cluster column in colData.

plotUMAP(sce.clus, colour_by="cluster")
Embedding plot single cells. The cells are labeled by clusters in the `cluster` column in `colData`.

Figure 5: Embedding plot single cells
The cells are labeled by clusters in the cluster column in colData.

The same mouse brain aSVG as the annotation method is used.

tail(feature.df$feature)
## [1] "brainstem"                    "midbrain"                    
## [3] "dorsal.plus.ventral.thalamus" "hypothalamus"                
## [5] "nose"                         "corpora.quadrigemina"

Similar with the annotation method, create a named list to match manual cell cluster clus1 with aSVG feature hypothalamus, and cluster clus3 with cerebral.cortex and midbrain. The latter demonstrates one cell cluster is matched to multiple aSVG features.

lis.match.clus <- list('clus1'=c('cerebral.cortex'), 'clus3'=c('hypothalamus', 'midbrain'))

[ThG: same problem as before the aggregate part is unclear.]

Aggregate cells by taking average (aggr='mean') of gene expression profiles within each manual cluster in the cluster column (sam.factor='cluster'). To illustrate aggregation only on cells, the experiment variable is not considered (con.factor=NULL). assay.na='logcounts' indicates the aggregation is based on normalized data.

sce.clus.aggr <- aggr_rep(sce.clus, assay.na='logcounts', sam.factor='cluster', con.factor=NULL, aggr='mean')

Co-visualization plots is built on aggretated data (sce.clus.aggr). Here gene Rpl7 is chosen. The manual clusters in the matching list (lis.match.clus) are colored by within-cluster aggregation of Rpl7 abundance profiles, and indicated by the legends under the embedding plots. The same color of each matched cluster is used to paint the bulk tissue counterpart in the anatomical image.

shm.lis <- spatial_hm(svg.path=svg.mus.brain, data=sce.clus.aggr, ID=c('Rpl7'), height=0.7, legend.r=1.5, legend.key.size=0.02, legend.text.size=12, legend.nrow=3, sce.dimred=sce.clus, dimred='PCA', cell.group='cluster', assay.na='logcounts', tar.cell=c('matched'), lis.rematch=lis.match.clus, bar.width=0.09, dim.lgd.nrow=1)
Co-visualizing single cells and bulk tissues by manual cell clusters. The expression profiles of gene `Rpl7` are used. The matched manual clusters are colored by within-cluster aggregation of expression profiles and the same colors are used to color the bulk tissue counterparts in the anatomical image.

Figure 6: Co-visualizing single cells and bulk tissues by manual cell clusters
The expression profiles of gene Rpl7 are used. The matched manual clusters are colored by within-cluster aggregation of expression profiles and the same colors are used to color the bulk tissue counterparts in the anatomical image.

[ThG: I don't understand why this section is called 'Manual'?]

2.6 Automatic

The automatic method (Figure 1.2c) assigns source bulk tissues to individual cells in an automatic manner through a co-clustering algorithm (Figure 7). This section showcases this algorithm and its application on co-visualization.

2.6.1 Algorithm Overview

The most desirable input bulk and single cell data would be raw count data (e.g. RNA-seq) derived from the same organ or tissue (Figure 7.1). [ThG: why is that a requirement? You need to better explain what the requirements are here.] If same organ, the chosen bulk tissues are major or complete tissues comprising this organ, while the single cell data are assayed on the region covering at least the chosen bulk tissues or on the whole organ. Similarly, if chosen bulk tissues are sub-tissues composing partial or the complete tissue, the single cells need to cover at least the chosen sub-tissues or the whole tissue. The key point is bulk tissues and single cells have anatomical overlap. In the pre-processing step (Figure 7.2), the count data of bulk tissues and single cells are initially filtered respectively to remove the noisy data for the subsequent normalization, which is explained in the downstream Pre-processing step. The main difference between bulk and single cell data is the sparsity in the latter, where rows are molecules (genes, etc) and columns are bulk tissues or cells respectively. To reduce such difference, they are combined in a column-wise manner and normalized as a whole, then subsequently separated (Figure 7.2b-c).7 I don't understand the combined and separated part?

The log2-scale normalized bulk (Figure 7.3) and single-cell (Figure 7.4a) data are filtered at a second time respectively. To increase accuracy of bulk tissue assignments in the downstream co-clustering, the filtered single-cell data are clustered (Figure 7.4, A. T. L. Lun, McCarthy, and Marioni (2016), Csardi and Nepusz (2006)) and each resultant cluster is refined to obtain more homogeneous clusters (Figure 7.5). The clustering is performed on top dimensions that are produced by PCA or UMAP techniques (A. T. L. Lun, McCarthy, and Marioni 2016, McCarthy et al. (2017)), where one dimension is equivalent to one molecule (e.g. gene). When refining cell clusters, cells having low similarities (Spearman's or Pearson's correlation coefficients) with other cells in the same cluster are filtered, which are marked by "X" in Figure 7.5a. Specifically, in a cluster only cells having a minimum similarity with a minimum proportion of other cells in the same cluster are retained. As a result, each cell cluster is more homogeneous (Figure 7.5b). Next, filtered bulk (Figure 7.3) and refined single cell (Figure 7.5b) data are re-combined in column-wise manner and co-clustering is performed on top dimensions of the combined data (Figure 7.6) to produce co-clusters (Figure 7.7).8 The text implies that the clustering is performed on the embedding data of the single cell data. However, what data is used for the bulk is unclear.

There are three types of co-clusters: 1) Multiple bulk tissues are co-clustered with many cells.9 I don't understand the part with the three types of clusters. To me it seems you are describing something that is too trivial to be mentioned, the name choice clusters is confusing or I am missing your point in general. Similarities (Spearman's or Pearson's correlation coefficients) are calculated for any bulk-cell pair in the same co-cluster. If a bulk has the largest similarity with a specific cell than any other bulk, this bulk is assigned to this cell as source bulk. For example, in Figure 7.7a bulk A is assigned to cell a1 because a1 has higher similarity with A than B; 2). Only one bulk tissue is co-clustered with many cells. This bulk is assigned to all the cells in the same co-cluster (7.7b); and 3) No bulk is present in a cluster. All these cells are treated as un-labelled (Figure 7.7c), which are potential novel cell populations. These automatic bulk assignments are stored in a table (Figure 7.8) and are used for downstream co-visualization (Figure 1.2c).

Overview of co-clustering. (1) The input are raw count data of bulk tissues and single cells, such as RNA-seq count data. (2) Bulk and single cell data are initially filtered respectively, where rows are molecules (genes, *etc*) and columns are bulk tissues and cells respectively. Then the filtered bulk and cell data are combined in a column-wise manner and nornalized as a whole. After that, bulk and single cells data are seprated. (3) The normalized bulk data are filtered at a second time. (4) The normalized single cell data are filtered at a second time and then clustered on their top dimensions such as PCA of UMAP. (5) a. In each single cell cluster, cells having low similarities (Spearman's or Pearson's correlation coefficients) with other cells in the same cluster are filtered, which are indicated by "X". b. The resultant clusters are more homogeneous. (6) Re-combine separated bulk and refined single cell data and cluster top dimensions of the combined data. (7) a. Multiple bulk tissues are co-clustered with many cells. If a bulk tissue has the largest similarity with a certain cell, this bulk would be assigned to this cell. For example, bulk A is assigned to cell a1 because A has higher similarity with a1 than B. b. Only one bulk is co-clustered with many cells. This bulk is assigned to every cell in the same co-cluster. c. No bulk is co-clustered with cells. These cells are regarded as un-labelled. (8) The automatic bulk tissue assignments are stored in a table. ^[ThG: a description of the itemized steps 1-8 is missing. Each or at least ranges of steps need to be mentioned in the legend. In addtion, the illustration includes itemization with letters only for the right part. This needs to be done consitently. Moreover, the Annotation and Manual parts are essentially identical and their data structure is the same as under step 8 of the Automated method. This means you can much better summarize things by including this table once. This would be much more economic and less confusing since you would avoid a lot of duplication you have in the illustration. Moreover, the preprocessing part can be simplified, no one will understand the 'separating' part (see comment above).]

Figure 7: Overview of co-clustering
(1) The input are raw count data of bulk tissues and single cells, such as RNA-seq count data. (2) Bulk and single cell data are initially filtered respectively, where rows are molecules (genes, etc) and columns are bulk tissues and cells respectively. Then the filtered bulk and cell data are combined in a column-wise manner and nornalized as a whole. After that, bulk and single cells data are seprated. (3) The normalized bulk data are filtered at a second time. (4) The normalized single cell data are filtered at a second time and then clustered on their top dimensions such as PCA of UMAP. (5) a. In each single cell cluster, cells having low similarities (Spearman's or Pearson's correlation coefficients) with other cells in the same cluster are filtered, which are indicated by "X". b. The resultant clusters are more homogeneous. (6) Re-combine separated bulk and refined single cell data and cluster top dimensions of the combined data. (7) a. Multiple bulk tissues are co-clustered with many cells. If a bulk tissue has the largest similarity with a certain cell, this bulk would be assigned to this cell. For example, bulk A is assigned to cell a1 because A has higher similarity with a1 than B. b. Only one bulk is co-clustered with many cells. This bulk is assigned to every cell in the same co-cluster. c. No bulk is co-clustered with cells. These cells are regarded as un-labelled. (8) The automatic bulk tissue assignments are stored in a table.10 ThG: a description of the itemized steps 1-8 is missing. Each or at least ranges of steps need to be mentioned in the legend. In addtion, the illustration includes itemization with letters only for the right part. This needs to be done consitently. Moreover, the Annotation and Manual parts are essentially identical and their data structure is the same as under step 8 of the Automated method. This means you can much better summarize things by including this table once. This would be much more economic and less confusing since you would avoid a lot of duplication you have in the illustration. Moreover, the preprocessing part can be simplified, no one will understand the 'separating' part (see comment above).

To obtain optimal default settings for the automatic method (co-clustering), main parameters related to filtering, normalization, dimension reduction, clustering, and cell cluster refining are optimized and tested on real bulk and single cell datasets, which is detailed in an external vignette. The optimization utilities are included in spatialHeatmap, so users have the choice to optimize this method on their own real data. This section showcases application of the automatic method on mouse brain data with default settings obtained through optimization. The bulk RNA-seq data are generated in a research on the impact of placental endocrine on mouse cerebellar development (Vacher et al. 2021) and the scRNA-seq data are from a study of mouse brain molecular atlas (Ortiz et al. 2020). Both bulk and single cell data sets are modified for demonstration purpose.

2.6.2 Pre-processing

To obtain reproducible results, a fixed seed is set for generating random numbers.

set.seed(10)

The example bulk and single cell count data are included in spatialHeatmap and imported. Replicates of the same bulk tissue should have the same identifiers. For example, all replicates of Cerebral Cortex have the label CERE.CORTEX instead of CERE.CORTEX1, CERE.CORTEX2, and so on.

# Example bulk data.
blk.mus.pa <- system.file("extdata/shinyApp/example", "bulk_mouse_cocluster.txt", package="spatialHeatmap") 
blk.mus <- as.matrix(read.table(blk.mus.pa, header=TRUE, row.names=1, sep='\t', check.names=FALSE))
blk.mus[1:3, 1:5]
##          CERE.CORTEX HIPP HYPOTHA CERE CERE.CORTEX
## AI593442         177  256      50   24         285
## Actr3b           513 1465     228  244         666
## Adcy1            701 1243      57 1910         836
# Example single cell data.
sc.mus.pa <- system.file("extdata/shinyApp/example", "cell_mouse_cocluster.txt", package="spatialHeatmap") 
sc.mus <- as.matrix(read.table(sc.mus.pa, header=TRUE, row.names=1, sep='\t', check.names=FALSE))
sc.mus[1:3, 1:5]
##          isocort isocort isocort isocort olfa
## AI593442       2       2       2       5    0
## Actr3b         3       5       4       4    1
## Adcy1          3       6       6       6    0

Make sure bulk tissues and single cells do not share identifiers. Otherwise, the co-clustering results are not accurate.

intersect(colnames(blk.mus), colnames(sc.mus))
## character(0)

The raw count data of bulk tissues and single cells are initially filtered to remove noisy values for subsequent normalization. In bulk data, genes with counts over 5 across bulk tissues at a minimum proportion of 0.05 (pOA) and coefficient of variances (CV) between 0.05 and 100 are retained. Refer to filterfun in the geneflter package for more details (Gentleman et al. 2018). In single cell data, cells with a minimum count of 1 (min.cnt=1) in at least 1% (p.in.cell=0.01) of genes are retained. Genes with a minimum count of 1 (min.cnt=1) in at least 5% (p.in.gen=0.05) of cells are retained.

blk.mus.fil1 <- filter_data(data=blk.mus, pOA=c(0.05, 5), CV=c(0.05, 100), verbose=FALSE) 
mus.lis.fil1 <- filter_cell(lis=list(sc.mus=sc.mus), bulk=blk.mus.fil1, min.cnt=1, p.in.cell=0.01, p.in.gen=0.05, verbose=FALSE) 

The main different between bulk and single cell data is the sparsity in the latter. To reduce such difference, they are combined in a column-wise manner and normalized together, then separated. The whole process is a single function call on norm_multi.

mus.lis.nor <- read_cache(cache.pa, 'mus.lis.nor') 
if (is.null(mus.lis.nor)) { 
  mus.lis.nor <- norm_multi(dat.lis=mus.lis.fil1, cpm=FALSE)
  save_cache(dir=cache.pa, overwrite=TRUE, mus.lis.nor)
}

The log2-scale normalized data of bulk tissues and single cells are filtered at a second time respectively in a similar way as the initial filtering.

blk.mus.fil2 <- filter_data(data=mus.lis.nor$bulk, pOA=c(0.1, 1), CV=c(0.1, 50), verbose=FALSE) 
mus.lis.fil2 <- filter_cell(lis=list(sc.mus=mus.lis.nor$sc.mus), bulk=blk.mus.fil2, min.cnt=1, p.in.cell=0.1, p.in.gen=0.01, verbose=FALSE) 

The same aSVG file of mouse brain as the Quick Start is used. The process of extracting aSVG features is not shown.

tail(feature.df$feature) # Partial features are shown.
## [1] "brainstem"                    "midbrain"                    
## [3] "dorsal.plus.ventral.thalamus" "hypothalamus"                
## [5] "nose"                         "corpora.quadrigemina"

The matching between bulk and single cells are automatic, while the matching between aSVG features and bulk tissues needs to be defined by users in a data.frame. Since there is no uniform naming scheme for aSVG features or bulk tissues. In most cases, identifiers of aSVG features and bulk tissues are user dependent, especially when they are from different sources. The matching data.frame is utilized to coordinate differences of naming schemes between aSVG features and bulk tissues. In addition, it gives flexibility for users to choose desired identifiers.

The matching table for this example is included in spatialHeatmap and imported. The SVGBulk and dataBulk columns are aSVG features and bulk tissues respectively. Note, these two columns names are fixed so that the algorithm is able to recognize them.

[ThG: again users won't have this. The optimization belongs into the supplement.] [JZ: It is related to optimization.]

match.mus.brain.pa <- system.file("extdata/shinyApp/example", "match_mouse_brain_cocluster.txt", package="spatialHeatmap")
df.match.mus.brain <- read.table(match.mus.brain.pa, header=TRUE, row.names=1, sep='\t')
df.match.mus.brain
##           SVGBulk    dataBulk
## 1      cerebellum        CERE
## 2     hippocampus        HIPP
## 3 cerebral.cortex CERE.CORTEX
## 4     hippocampus        HIPP
## 5    hypothalamus     HYPOTHA

2.6.3 Co-clustering

The processes of clustering single cells (Figure 7.4), refining cell clusters (Figure 7.5), and co-clustering bulk and single cells (Figure 7.6-8) are performed in a single function call. Setting return.all=TRUE returns results in a list (res.lis).

res.lis <- read_cache(cache.pa, 'res.lis')
if (is.null(res.lis)) {
  res.lis <- coclus_meta(bulk=mus.lis.fil2$bulk, cell=mus.lis.fil2$sc.mus, df.match=df.match.mus.brain, return.all=TRUE, multi.core.par=MulticoreParam(workers=1), verbose=FALSE)
  res.lis <- res.lis[[1]]
  save_cache(dir=cache.pa, overwrite=TRUE, res.lis)
}

The source bulk tissue assignments are stored in the colData slot of sce.asg in the result list (res.lis), which are partialy shown below. The assignedBulk indicates assigned source bulk tissues for each cell, and the SVGBulk contains corresponding aSVG features, which is based on the matching table df.match.mus.brain. The predictor includes similarities (Spearman's correlation coefficient) between single cells and assigned bulk tissues.

colData(res.lis$sce.asg)[1:2, c('cell', 'assignedBulk', 'predictor', 'SVGBulk')]
## DataFrame with 2 rows and 4 columns
##             cell assignedBulk   predictor     SVGBulk
##      <character>  <character> <character> <character>
## olfa        olfa         HIPP       0.407 hippocampus
## hipp        hipp         HIPP       0.462 hippocampus

The co-cluster information is stored in the colData slot of sce.bulk.cell in the result list (res.lis). The cluster and cell columns indicate co-cluster labels and bulk/cell identifiers respectively.

colData(res.lis$sce.bulk.cell)[1:3, ]
## DataFrame with 3 rows and 2 columns
##                 cluster        cell
##             <character> <character>
## CERE.CORTEX      clus12 CERE.CORTEX
## HIPP             clus10        HIPP
## HYPOTHA          clus23     HYPOTHA

Each co-cluster consisting of bulk and cells can be visualized in an embedding plot. The following plot is the visualization of co-cluster clus11, where other bulk and cells are in gray.

plot_dim(res.lis$sce.bulk.cell, dim='PCA', color.by='cluster', group.sel='clus11')
Embedding plot of co-clusters. Dots represent bulk tissues or single cells. The bulk and cells in the target co-cluster are in blue while all other bulk and cells are in gray.

Figure 8: Embedding plot of co-clusters
Dots represent bulk tissues or single cells. The bulk and cells in the target co-cluster are in blue while all other bulk and cells are in gray.

[ThG: users would expect to see a plot next, yet a new subsection starts now that lacks context. Is this an oversight?]

2.6.4 Tailoring Co-clustering Results

The automatic (co-clustering) method is optimized and tested on a small number of real bulk and single cell datasets, due to the limited availability of single cell datasets where single cells are assayed on a whole organ or whole large tissue and each single cell is well annotated. Thus, the accuracy of automatic source bulk tissue assignments might be also limited. As a remedy, utilities are developed for tailoring the automatic bulk tissue assignments at user preferences. Note, the tailoring step is optional, if users are satisfied with the bulk assignments, just proceed to the co-visualization step.

The tailoring can be performed in command line or on a Shiny app. This section illustrates the command-line based tailoring. First visualize single cells after single-cell cluster refining (Figure 7.5) in an embedding plot as shown below. In order to define more accurate coordinates in the next step, tune the x-y axis breaks (x.break, y.break) and set panel.grid=TRUE.

plot_dim(res.lis$cell.refined, dim='UMAP', color.by='cell', x.break=seq(-10, 10, 1), y.break=seq(-10, 10, 1), panel.grid=TRUE)
PCA embedding plot of mouse brain single cell data. Single cells right after cluster refining are plotted.

Figure 9: PCA embedding plot of mouse brain single cell data
Single cells right after cluster refining are plotted.

Define desired bulk tissues (desiredSVGBulk) for cells selected by x-y coordinate ranges (x.min, x.max, y.min, y.max) in the embedding plot in form of a data.frame (df.desired.bulk). The dimred reveals where the coordinates come from and are required. In this example, cerebellum is the desired bulk tissue for cells in the selected region.

df.desired.bulk <- data.frame(x.min=c(1), x.max=c(3), y.min=c(3), y.max=c(4), desiredSVGBulk=c('hypothalamus'), dimred='UMAP')
df.desired.bulk
##   x.min x.max y.min y.max desiredSVGBulk dimred
## 1     1     3     3     4   hypothalamus   UMAP

Incorporate desired bulk assignments to co-clustering results by calling refine_asg. The predictors corresponding to desired bulk are internally set at the maximum of 1. The thr argument is a predictor threshold and used to filter bulk assignments.

[ThG: lack of context; hard to follow; does this need to be here?]

res.lis <- refine_asg(res.lis=res.lis, thr=0, df.desired.bulk=df.desired.bulk, df.match=df.match.mus.brain)

2.6.5 Co-visualization

Similar with the annotation-based and manual methods, cells under the same SVG feature (bulk tissue) are aggregated by averaging (aggr='mean') normalized assay profiles (assay.na='logcounts'). The aggregated assay profiles are used to color corresponding aSVG features.

[ThG: I am unable follow. Please provide clearer explantions to help readers what is done here or remove this part. The fact that you are not showing any output or visualization in this and other code sections makes it very hard to understand what is happening throughout.]

sce.aggr <- aggr_rep(data=res.lis$sce.asg, assay.na='logcounts', sam.factor='SVGBulk', con.factor=NULL, aggr='mean')

The co-visualization of bulk and single cells is built on aggregated gene abundance profiles (profile=TRUE) of gene Adcy1. All available bulk tissues (aSVG features) are selected as targets, i.e. hippocampus, hypothalamus, cerebellum and cerebral.cortex. In the embedding plot, cells matching the same target aSVG feature in the anatomical image are colored according to their aggregated assay profiles, and the same color is used to fill their matching aSVG features. The cells defined in df.desired.bulk in the tailoring section are also colored in the embedding plot and counted when aggregating expression profiles.

shm.lis1 <- spatial_hm(svg.path=svg.mus.brain, data=sce.aggr, ID=c('Adcy1'), legend.nrow=4, sce.dimred=res.lis$cell.refined, assay.na='logcounts', dimred='UMAP', tar.bulk=c('hippocampus', 'hypothalamus', 'cerebellum', 'cerebral.cortex'), profile=TRUE, dim.lgd.text.size=10, dim.lgd.nrow=2, bar.width=0.1)
Co-visualizing bulk and single cells of mouse brain with abundance profiles. The aggregated expression profile of gene `Adcy1` in cells matching the same bulk is used to fill these cells and bulk tissues.

Figure 10: Co-visualizing bulk and single cells of mouse brain with abundance profiles
The aggregated expression profile of gene Adcy1 in cells matching the same bulk is used to fill these cells and bulk tissues.

Bulk tissues and single cells are co-visualized without abundance profiles (profile=FALSE). All available bulk tissues (aSVG features) are selected to show, i.e. hippocampus, hypothalamus, cerebellum and cerebral.cortex. They are filled by different colors in the anatomical image, while the matching cells are indicated in the embedding plot with the same color as their source bulk tissues. Cells selected in the tailoring section are also colored in the embedding plot.

shm.lis2 <- spatial_hm(svg.path=svg.mus.brain, data=sce.aggr, legend.nrow=4, sce.dimred=res.lis$cell.refined, dimred='UMAP', tar.bulk=c('hippocampus', 'hypothalamus', 'cerebellum', 'cerebral.cortex'), profile=FALSE, dim.lgd.text.size=10, dim.lgd.nrow=2)
Co-visualizing bulk and single cells of mouse brain without abundance profiles. The matching between cells and source bulk tissues (aSVG features) is denoted by the same color between the embedding plot and anatomical image.

Figure 11: Co-visualizing bulk and single cells of mouse brain without abundance profiles
The matching between cells and source bulk tissues (aSVG features) is denoted by the same color between the embedding plot and anatomical image.

3 Using Shiny App

[ThG: this section would need to show a screenshot showing where this is used in the shiny app.]

The co-visualization feature is included in the integrated Shiny app that is an GUI implementation of spatialHeatmap, including annotation-based, maual, and automatic methods. To start this app, simply call shiny_shm() in R. Below is a screenshot of the co-visulization output generated by the automatic method.

Screenshot of the co-visualization output in Shiny app. The co-visualization plot is generated by the automatic method.

Figure 12: Screenshot of the co-visualization output in Shiny app
The co-visualization plot is generated by the automatic method.

When using the Shiny app, single cell data in annotation-based and manual methods or combined single cell and bulk data in automatic method are stored in a SingleCellExperiment object and saved in an .rds file by saveRDS, then the .rds file should be uploaded to the app. In the manual method, the manual clusters in a tabular file should be included in the colData slot by calling manual_group before creating an .rds file, as shown here. In the automatic method, bulk tissues and single cells are labeled by bulk and cell respectively in the bulkCell column in colData slot. The matching table between bulk tissues and aSVG features is stored in the metadata list with the name df.match. The example .rds file below illustrates these rules.

sce.auto <- readRDS(system.file("extdata/shinyApp/example", 'sce_auto_bulk_cell_mouse_brain.rds', package="spatialHeatmap"))
colData(sce.auto)
## DataFrame with 2219 rows and 1 column
##                bulkCell
##             <character>
## CERE.CORTEX        bulk
## HIPP               bulk
## HYPOTHA            bulk
## CERE               bulk
## CERE.CORTEX        bulk
## ...                 ...
## midbrain           cell
## midbrain           cell
## midbrain           cell
## hindbrain          cell
## midbrain           cell
metadata(sce.auto)$df.match
##           SVGBulk    dataBulk
## 1      cerebellum        CERE
## 2     hippocampus        HIPP
## 3 cerebral.cortex CERE.CORTEX
## 4     hippocampus        HIPP
## 5    hypothalamus     HYPOTHA

4 Supplementary Section

4.1 Assigning Desired Bulk on Shiny App

[ThG: I don't understand the above paragraph. The shiny part is totally out of context.]

This section describes tailoring co-clustering results on the convenience Shiny app, which is lauched by calling desired_bulk_shiny.

Figure 13 is the screenshot of the Shiny app. The file to upload is an .rds file of a SingleCellExperiment object saved by saveRDS. An example of how to generate such a file is seen in the help file of desired_bulk_shiny. On the left embedding plot, cells are selected with the "Lasso Select" tool. On the right, selected cells and their coordinates are listed in a table, and the desired bulk tissues (aSVG features) can be selected from the dropdown list, here cerebral.cortex. To download the table just click the "Download" button. The "Help" button gives more instructions.

Screenshot of the Shiny app for selecting desired bulk tissues. On the left is the embedding plot of single cells, where target cells are selected with the "Lasso Select" tool. On the right, desired bulk tissues are assigned for selected cell.

Figure 13: Screenshot of the Shiny app for selecting desired bulk tissues
On the left is the embedding plot of single cells, where target cells are selected with the "Lasso Select" tool. On the right, desired bulk tissues are assigned for selected cell.

An example of desired bulk downloaded from the convenience Shiny app is shown below. The x-y coordinates refer to single cells in embbeding plots (dimred). The df.desired.bulk is ready to use in the tailoring section.

desired.blk.pa <- system.file("extdata/shinyApp/example", "selected_cells_with_desired_bulk.txt", package="spatialHeatmap")
df.desired.bulk <- read.table(desired.blk.pa, header=TRUE, row.names=1, sep='\t')
df.desired.bulk[1:3, ]
##          x         y  key desiredSVGBulk dimred
## 1 1.403660 -7.061707 3294     cerebellum   UMAP
## 2 1.434507 -7.094477 3302     cerebellum   UMAP
## 3 1.383284 -7.072968 3311     cerebellum   UMAP


5 Version Informaion

sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] kableExtra_1.3.4            htmltools_0.5.2            
##  [3] DT_0.22                     spatialHeatmap_2.1.3       
##  [5] BiocParallel_1.30.0         igraph_1.3.1               
##  [7] scater_1.24.0               ggplot2_3.3.6              
##  [9] scran_1.24.0                scuttle_1.6.0              
## [11] SingleCellExperiment_1.18.0 SummarizedExperiment_1.26.1
## [13] Biobase_2.56.0              GenomicRanges_1.48.0       
## [15] GenomeInfoDb_1.32.1         IRanges_2.30.0             
## [17] S4Vectors_0.34.0            BiocGenerics_0.42.0        
## [19] MatrixGenerics_1.8.0        matrixStats_0.62.0         
## [21] knitr_1.39                  BiocStyle_2.24.0           
## [23] nvimcom_0.9-25             
## 
## loaded via a namespace (and not attached):
##   [1] rappdirs_0.3.3            tidyr_1.2.0              
##   [3] bit64_4.0.5               irlba_2.3.5              
##   [5] DelayedArray_0.22.0       data.table_1.14.2        
##   [7] rpart_4.1.16              KEGGREST_1.36.0          
##   [9] RCurl_1.98-1.6            doParallel_1.0.17        
##  [11] generics_0.1.2            preprocessCore_1.58.0    
##  [13] ScaledMatrix_1.4.0        cowplot_1.1.1            
##  [15] RSQLite_2.2.13            bit_4.0.4                
##  [17] webshot_0.5.3             xml2_1.3.3               
##  [19] httpuv_1.6.5              assertthat_0.2.1         
##  [21] viridis_0.6.2             xfun_0.30                
##  [23] hms_1.1.1                 jquerylib_0.1.4          
##  [25] evaluate_0.15             promises_1.2.0.1         
##  [27] fansi_1.0.3               progress_1.2.2           
##  [29] caTools_1.18.2            dbplyr_2.1.1             
##  [31] DBI_1.1.2                 geneplotter_1.74.0       
##  [33] htmlwidgets_1.5.4         purrr_0.3.4              
##  [35] ellipsis_0.3.2            RSpectra_0.16-1          
##  [37] dplyr_1.0.9               backports_1.4.1          
##  [39] bookdown_0.26             annotate_1.74.0          
##  [41] sparseMatrixStats_1.8.0   vctrs_0.4.1              
##  [43] cachem_1.0.6              withr_2.5.0              
##  [45] checkmate_2.1.0           prettyunits_1.1.1        
##  [47] svglite_2.1.0             cluster_2.1.3            
##  [49] grImport_0.9-5            lazyeval_0.2.2           
##  [51] crayon_1.5.1              genefilter_1.78.0        
##  [53] labeling_0.4.2            edgeR_3.38.0             
##  [55] pkgconfig_2.0.3           vipor_0.4.5              
##  [57] nnet_7.3-17               rlang_1.0.2              
##  [59] lifecycle_1.0.1           filelock_1.0.2           
##  [61] BiocFileCache_2.4.0       rsvd_1.0.5               
##  [63] av_0.7.0                  rsvg_2.3.1               
##  [65] rngtools_1.5.2            Matrix_1.4-1             
##  [67] Rhdf5lib_1.18.0           base64enc_0.1-3          
##  [69] beeswarm_0.4.0            png_0.1-7                
##  [71] viridisLite_0.4.0         bitops_1.0-7             
##  [73] shinydashboard_0.7.2      KernSmooth_2.23-20       
##  [75] visNetwork_2.1.0          rhdf5filters_1.8.0       
##  [77] pROC_1.18.0               Biostrings_2.64.0        
##  [79] blob_1.2.3                DelayedMatrixStats_1.18.0
##  [81] doRNG_1.8.2               stringr_1.4.0            
##  [83] jpeg_0.1-9                gridGraphics_0.5-1       
##  [85] rols_2.24.0               beachmat_2.12.0          
##  [87] scales_1.2.0              memoise_2.0.1            
##  [89] magrittr_2.0.3            plyr_1.8.7               
##  [91] gplots_3.1.3              distinct_1.8.0           
##  [93] zlibbioc_1.42.0           compiler_4.2.0           
##  [95] dqrng_0.3.0               RColorBrewer_1.1-3       
##  [97] DESeq2_1.36.0             cli_3.3.0                
##  [99] XVector_0.36.0            htmlTable_2.4.0          
## [101] Formula_1.2-4             MASS_7.3-57              
## [103] WGCNA_1.71                tidyselect_1.1.2         
## [105] stringi_1.7.6             highr_0.9                
## [107] yaml_2.3.5                BiocSingular_1.12.0      
## [109] locfit_1.5-9.5            latticeExtra_0.6-29      
## [111] ggrepel_0.9.1             grid_4.2.0               
## [113] sass_0.4.1                tools_4.2.0              
## [115] parallel_4.2.0            rstudioapi_0.13          
## [117] bluster_1.6.0             foreach_1.5.2            
## [119] foreign_0.8-82            metapod_1.4.0            
## [121] gridExtra_2.3             farver_2.1.0             
## [123] Rtsne_0.16                digest_0.6.29            
## [125] BiocManager_1.30.17       FNN_1.1.3                
## [127] shiny_1.7.2               Rcpp_1.0.8.3             
## [129] later_1.3.0               RcppAnnoy_0.0.19         
## [131] httr_1.4.3                ggdendro_0.1.23          
## [133] AnnotationDbi_1.58.0      colorspace_2.0-3         
## [135] rvest_1.0.2               XML_3.99-0.9             
## [137] splines_4.2.0             uwot_0.1.11              
## [139] yulab.utils_0.0.4         statmod_1.4.36           
## [141] ggplotify_0.1.0           plotly_4.10.0            
## [143] systemfonts_1.0.4         xtable_1.8-4             
## [145] jsonlite_1.8.0            dynamicTreeCut_1.63-1    
## [147] UpSetR_1.4.0              flashClust_1.01-2        
## [149] R6_2.5.1                  Hmisc_4.7-0              
## [151] pillar_1.7.0              mime_0.12                
## [153] glue_1.6.2                fastmap_1.1.0            
## [155] BiocNeighbors_1.14.0      codetools_0.2-18         
## [157] utf8_1.2.2                lattice_0.20-45          
## [159] bslib_0.3.1               tibble_3.1.7             
## [161] curl_4.3.2                ggbeeswarm_0.6.0         
## [163] gtools_3.9.2              magick_2.7.3             
## [165] GO.db_3.15.0              survival_3.3-1           
## [167] limma_3.52.0              rmarkdown_2.14           
## [169] munsell_0.5.0             fastcluster_1.2.3        
## [171] rhdf5_2.40.0              GenomeInfoDbData_1.2.8   
## [173] iterators_1.0.14          HDF5Array_1.24.0         
## [175] impute_1.70.0             reshape2_1.4.4           
## [177] gtable_0.3.0

6 Funding

This project has been funded by NSF awards: PGRP-1546879, PGRP-1810468, PGRP-1936492.

7 References

Amezquita, Robert, Aaron Lun, Etienne Becht, Vince Carey, Lindsay Carpp, Ludwig Geistlinger, Federico Marini, et al. 2020. “Orchestrating Single-Cell Analysis with Bioconductor.” Nature Methods 17: 137–45. https://www.nature.com/articles/s41592-019-0654-x.

Chang, Winston, and Barbara Borges Ribeiro. 2018. Shinydashboard: Create Dashboards with ’Shiny’. https://CRAN.R-project.org/package=shinydashboard.

Chang, Winston, Joe Cheng, JJ Allaire, Cars on Sievert, Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert, and Barbara Borges. 2021. Shiny: Web Application Framework for R. https://CRAN.R-project.org/package=shiny.

Csardi, Gabor, and Tamas Nepusz. 2006. “The Igraph Software Package for Complex Network Research.” InterJournal Complex Systems: 1695. http://igraph.org.

Gentleman, R, V Carey, W Huber, and F Hahne. 2018. “Genefilter: Methods for Filtering Genes from High-Throughput Experiments.” http://bioconductor.uib.no/2.7/bioc/html/genefilter.html.

Lun, Aaron T. L., Davis J. McCarthy, and John C. Marioni. 2016. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell RNA-Seq Data with Bioconductor.” F1000Res. 5: 2122. doi:10.12688/f1000research.9501.2.

Marques, Sueli, Amit Zeisel, Simone Codeluppi, David van Bruggen, Ana Mendanha Falcão, Lin Xiao, Huiliang Li, et al. 2016. “Oligodendrocyte Heterogeneity in the Mouse Juvenile and Adult Central Nervous System.” Science 352 (6291): 1326–9.

McCarthy, Davis J., Kieran R. Campbell, Aaron T. L. Lun, and Quin F. Willis. 2017. “Scater: Pre-Processing, Quality Control, Normalisation and Visualisation of Single-Cell RNA-Seq Data in R.” Bioinformatics 33 (8): 1179–86. doi:10.1093/bioinformatics/btw777.

Ortiz, Cantin, Jose Fernandez Navarro, Aleksandra Jurek, Antje Märtin, Joakim Lundeberg, and Konstantinos Meletis. 2020. “Molecular Atlas of the Adult Mouse Brain.” Science Advances 6 (26). American Association for the Advancement of Science: eabb3446.

Risso, Davide, and Michael Cole. 2022. ScRNAseq: Collection of Public Single-Cell RNA-Seq Datasets.

Vacher, Claire-Marie, Helene Lacaille, Jiaqi J O’Reilly, Jacquelyn Salzbank, Dana Bakalar, Sonia Sebaoui, Philippe Liere, et al. 2021. “Placental Endocrine Function Shapes Cerebellar Development and Social Behavior.” Nat. Neurosci. 24 (10). Nature Publishing Group: 1392–1401.

Zhang, Jianhai, Jordan Hayes, Le Zhang, Bing Yang, Wolf Frommer, Julia Bailey-Serres, and Thomas Girke. 2022. SpatialHeatmap: SpatialHeatmap. https://github.com/jianhaizhang/spatialHeatmap.